Vace finetuning by Tatiana21 · Pull Request #3 · huvunvidia/Megatron-Bridge

Tatiana21 · 2025-12-10T23:56:18Z

Code for:

Creating segmentation masks for Inpainting tasks for a video dataset, based on open-sora-plan dataset format
Preprocessing and creating energon dataset for finetuing with T2V, I2V and V2V tasks
Finetuning vace, for T2V, I2V and V2V tasks

Refer to annotators/Inpainting/run_batch_process.sh for segmentation.
Refer to example_commands.sh for commands to process datasets and launch training.

* fix cpu init during export Signed-off-by: yaoyu-33 <[email protected]> * export env fix Signed-off-by: yaoyu-33 <[email protected]> * delete_extra_state for TE related during checkpoint loading for export Signed-off-by: yaoyu-33 <[email protected]> * paths fixes Signed-off-by: yaoyu-33 <[email protected]> * add override_provider option for checkpoint loading Signed-off-by: yaoyu-33 <[email protected]> * add unit test for override_provider option Signed-off-by: yaoyu-33 <[email protected]> * remove debug lines Signed-off-by: yaoyu-33 <[email protected]> * lint Signed-off-by: yaoyu-33 <[email protected]> * unit test fix Signed-off-by: yaoyu-33 <[email protected]> --------- Signed-off-by: yaoyu-33 <[email protected]>

* chore: Add issue template for model requests Signed-off-by: oliver könig <[email protected]> * copying over remaining templates Signed-off-by: oliver könig <[email protected]> --------- Signed-off-by: oliver könig <[email protected]>

* ci: Skip if `docs-only` label is attached Signed-off-by: oliver könig <[email protected]> * test Signed-off-by: oliver könig <[email protected]> * test Signed-off-by: oliver könig <[email protected]> * test Signed-off-by: oliver könig <[email protected]> * update Signed-off-by: oliver könig <[email protected]> --------- Signed-off-by: oliver könig <[email protected]>

* cleanup process group at end of performance script Signed-off-by: Ananth Subramaniam <[email protected]> * Update scripts/performance/run_script.py Signed-off-by: Ananth Subramaniam <[email protected]> * destroy pg for other scripts Signed-off-by: Ananth Subramaniam <[email protected]> * update Signed-off-by: Ananth Subramaniam <[email protected]> --------- Signed-off-by: Ananth Subramaniam <[email protected]> Signed-off-by: Ananth Subramaniam <[email protected]>

* ci(fix): pre-flight Signed-off-by: oliver könig <[email protected]> * test Signed-off-by: oliver könig <[email protected]> * test Signed-off-by: oliver könig <[email protected]> * final Signed-off-by: oliver könig <[email protected]> --------- Signed-off-by: oliver könig <[email protected]>

Signed-off-by: Ananth Subramaniam <[email protected]>

Signed-off-by: oliver könig <[email protected]>

* initial gemma commit Signed-off-by: Ananth Subramaniam <[email protected]> * gemma provider Signed-off-by: Ananth Subramaniam <[email protected]> * patch tests Signed-off-by: Ananth Subramaniam <[email protected]> * add gemma bridge + tests Signed-off-by: Ananth Subramaniam <[email protected]> * fix conftest Signed-off-by: Ananth Subramaniam <[email protected]> * reenable msc Signed-off-by: Ananth Subramaniam <[email protected]> * fix gemma test fallback Signed-off-by: Ananth Subramaniam <[email protected]> * try simpler tokenizer Signed-off-by: Ananth Subramaniam <[email protected]> * upload assets Signed-off-by: Ananth Subramaniam <[email protected]> * use pre-downloaded config for model provider test Signed-off-by: Ananth Subramaniam <[email protected]> * lint Signed-off-by: Ananth Subramaniam <[email protected]> * address feedback -s Signed-off-by: Ananth Subramaniam <[email protected]> * rebase Signed-off-by: Ananth Subramaniam <[email protected]> * rebase Signed-off-by: Ananth Subramaniam <[email protected]> * use mcore activations Signed-off-by: Ananth Subramaniam <[email protected]> * update test Signed-off-by: Ananth Subramaniam <[email protected]> * fix mock Signed-off-by: Ananth Subramaniam <[email protected]> * fix conversion script reference Signed-off-by: Ananth Subramaniam <[email protected]> * subclass Signed-off-by: Ananth Subramaniam <[email protected]> * update tests Signed-off-by: Ananth Subramaniam <[email protected]> --------- Signed-off-by: Ananth Subramaniam <[email protected]>

* [docs] packed sequences Signed-off-by: Ananth Subramaniam <[email protected]> * [docs] packed sequences Signed-off-by: Ananth Subramaniam <[email protected]> * address feedback Signed-off-by: Ananth Subramaniam <[email protected]> --------- Signed-off-by: Ananth Subramaniam <[email protected]>

* gemma2 provider and bridge Signed-off-by: Ananth Subramaniam <[email protected]> * gemma2 model provider + bridge Signed-off-by: Ananth Subramaniam <[email protected]> --------- Signed-off-by: Ananth Subramaniam <[email protected]>

* docs] placeholder page for performance summary Signed-off-by: Ananth Subramaniam <[email protected]> * add sections for releases Signed-off-by: Ananth Subramaniam <[email protected]> * improve description Signed-off-by: Ananth Subramaniam <[email protected]> --------- Signed-off-by: Ananth Subramaniam <[email protected]>

… compatibility (NVIDIA-NeMo#829) * save latest_checkpointed_iteration for compatibility Signed-off-by: Ananth Subramaniam <[email protected]> * fix megatron fsdp test assertion Signed-off-by: Ananth Subramaniam <[email protected]> --------- Signed-off-by: Ananth Subramaniam <[email protected]>

* exit profiler context Signed-off-by: Ananth Subramaniam <[email protected]> * disable vocab size logging in flops calculation Signed-off-by: Ananth Subramaniam <[email protected]> --------- Signed-off-by: Ananth Subramaniam <[email protected]>

Signed-off-by: Ananth Subramaniam <[email protected]>

* Clear disk space before install check Signed-off-by: Charlie Truong <[email protected]> * Revert "Clear disk space before install check" This reverts commit 2c085f5. Signed-off-by: Charlie Truong <[email protected]> * Run bare metal install on self-hosted runners Signed-off-by: Charlie Truong <[email protected]> --------- Signed-off-by: Charlie Truong <[email protected]>

Signed-off-by: oliver könig <[email protected]>

…A-NeMo#607) * update llama and qwen models to use auto bridge and update recipes test as well Signed-off-by: yaoyu-33 <[email protected]> * temporary remove llama4 as it's not fully tested or verified. Signed-off-by: yaoyu-33 <[email protected]> * Revert "temporary remove llama4 as it's not fully tested or verified." This reverts commit 5217084. * temp save Signed-off-by: yaoyu-33 <[email protected]> * temp save Signed-off-by: yaoyu-33 <[email protected]> * Revert "temp save" This reverts commit 0c57e2b. * Revert "temp save" This reverts commit 0748d52. * update qwen's recipes Signed-off-by: yaoyu-33 <[email protected]> * update llama recipes Signed-off-by: yaoyu-33 <[email protected]> * remove some old recipe files Signed-off-by: yaoyu-33 <[email protected]> * update recipe files to match old recipes Signed-off-by: yaoyu-33 <[email protected]> * update recipe file Signed-off-by: yaoyu-33 <[email protected]> * update qwen recipes Signed-off-by: yaoyu-33 <[email protected]> * update llama recipes Signed-off-by: yaoyu-33 <[email protected]> * Update src/megatron/bridge/recipes/qwen/qwen3.py Co-authored-by: Ananth Subramaniam <[email protected]> Signed-off-by: Yu Yao <[email protected]> * Update src/megatron/bridge/recipes/qwen/qwen3.py Co-authored-by: Ananth Subramaniam <[email protected]> Signed-off-by: Yu Yao <[email protected]> * Update src/megatron/bridge/recipes/qwen/qwen3.py Co-authored-by: Ananth Subramaniam <[email protected]> Signed-off-by: Yu Yao <[email protected]> * Update src/megatron/bridge/recipes/llama/llama2.py Co-authored-by: Ananth Subramaniam <[email protected]> Signed-off-by: Yu Yao <[email protected]> * Update src/megatron/bridge/recipes/llama/llama2.py Co-authored-by: Ananth Subramaniam <[email protected]> Signed-off-by: Yu Yao <[email protected]> * recipe naming update Signed-off-by: yaoyu-33 <[email protected]> * update test Signed-off-by: yaoyu-33 <[email protected]> * lint Signed-off-by: yaoyu-33 <[email protected]> * add TypedDict for args Signed-off-by: yaoyu-33 <[email protected]> * lint Signed-off-by: yaoyu-33 <[email protected]> * update docstring Signed-off-by: yaoyu-33 <[email protected]> * unit test fix and license fix Signed-off-by: yaoyu-33 <[email protected]> * sync eval_interval and save_interval Signed-off-by: yaoyu-33 <[email protected]> * add comments Signed-off-by: yaoyu-33 <[email protected]> * set TRANSFORMERS_OFFLINE=1 in action.yml Signed-off-by: yaoyu-33 <[email protected]> * fix llama3 8b hf model path Signed-off-by: yaoyu-33 <[email protected]> * replay lr decay iters update on updated recipes Signed-off-by: yaoyu-33 <[email protected]> * Update action.yml Signed-off-by: Yu Yao <[email protected]> * add comments Signed-off-by: yaoyu-33 <[email protected]> * Add guard / mock for the places needs to download hf config in unit test Signed-off-by: yaoyu-33 <[email protected]> * lint Signed-off-by: yaoyu-33 <[email protected]> * add qwen functional test Signed-off-by: yaoyu-33 <[email protected]> * update recipe tests Signed-off-by: yaoyu-33 <[email protected]> * lint Signed-off-by: yaoyu-33 <[email protected]> --------- Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: Yu Yao <[email protected]> Co-authored-by: Ananth Subramaniam <[email protected]>

Signed-off-by: Ananth Subramaniam <[email protected]>

…ation support - Introduced `pretrain_DiT_Model.py` for flexible pretraining using Megatron-Bridge. - Updated `DITForwardStep` class to use `__call__` method for forward steps. - Modified dataset configuration in `pretrain_config` to utilize `DiffusionDataModule`. - Adjusted tensor and context parallelism settings in `llama3_8b.py`. This commit enhances the pretraining capabilities and configuration flexibility for Llama3 models.

…into init_dit

- Commented out sections in `pretrain_DiT_Model.py` related to OmegaConf merging and command-line overrides for clarity. - Added `backend` configuration in `llama3_8b_pretrain_override_example.yaml`. - Updated `init_global_step` handling in `EnergonMultiModalDataModule` to simplify initialization. - Introduced `DiffusionDataModuleConfig` for better dataset configuration management. - Adjusted model parameters in `llama_provider.py` to set `num_layers` to 2 and added `seq_length` and `vocab_size` attributes in `DiTModelProvider`. - Refined imports across various modules to ensure consistency and clarity. This commit enhances the configuration structure and model initialization process, improving maintainability and usability.

…e branch

Copilot

Pull request overview

This pull request introduces comprehensive support for Vace (Video Auto-Context Encoding) finetuning capabilities, including tools for dataset preparation, preprocessing pipelines, and training infrastructure for T2V (Text-to-Video), I2V (Image-to-Video), and V2V (Video-to-Video) tasks. The implementation extends the existing WAN (Wide Attention Network) model architecture with VACE-specific layers and flow-matching training pipelines.

Key Changes

Added VACE model architecture with context and base layers for video editing tasks
Implemented flow matching training pipeline with configurable timestep sampling strategies
Created preprocessing utilities for video/image/mask data and segmentation mask generation
Added Gemma model family support (Gemma 1.0 and Gemma 2.0) with proper embedding scaling

Reviewed changes

Copilot reviewed 108 out of 229 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
src/megatron/bridge/models/wan/wan_layer_spec.py	Defines WAN transformer layer specifications including VACE-specific base and context layers with adaptive layer normalization
src/megatron/bridge/models/wan/wan_bridge.py	Implements parameter mapping bridges between HuggingFace and Megatron formats for WAN and VACE models
src/megatron/bridge/models/wan/utils/utils.py	Provides utility functions for grid size calculation, patching/unpatching, and context parallelism operations
src/megatron/bridge/models/wan/utils/preprocessor.py	Implements video and image preprocessing classes with resizing, cropping, and normalization capabilities
src/megatron/bridge/models/wan/rope_utils.py	Implements 3D RoPE (Rotary Position Embeddings) for spatial-temporal attention in video models
src/megatron/bridge/models/wan/modules/vae.py	Defines VAE encoder/decoder architecture with causal 3D convolutions for video latent encoding
src/megatron/bridge/models/wan/modules/tokenizers.py	Provides HuggingFace tokenizer wrapper with text cleaning utilities
src/megatron/bridge/models/wan/modules/t5.py	Implements T5 encoder/decoder models with custom layer normalization and attention mechanisms
src/megatron/bridge/models/wan/flow_matching/time_shift_utils.py	Implements timestep sampling strategies and sigma computation for flow matching training
src/megatron/bridge/models/wan/flow_matching/flow_pipeline.py	Defines training pipeline for flow matching with support for both WAN and VACE models
src/megatron/bridge/models/wan/flow_matching/flow_inference_pipeline.py	Implements inference pipeline with DPM/UniPC solvers and pipeline parallelism support
src/megatron/bridge/models/wan/inference/configs/*.py	Configuration files for different WAN model variants (T2V, I2V, VACE) with size-specific settings
src/megatron/bridge/models/gemma/*.py	Adds complete Gemma model family support with proper embedding scaling and configuration mappings

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-10T23:57:19Z

src/megatron/bridge/models/wan/wan_layer_spec.py

+            query = query.contiguous() # important becuase TE attention expects contiguous tensors
+            key = key.contiguous() # important becuase TE attention expects contiguous tensors


Corrected spelling of 'becuase' to 'because' in both comments.

Suggested change

query = query.contiguous() # important becuase TE attention expects contiguous tensors

key = key.contiguous() # important becuase TE attention expects contiguous tensors

query = query.contiguous() # important because TE attention expects contiguous tensors

key = key.contiguous() # important because TE attention expects contiguous tensors

Copilot · 2025-12-10T23:57:19Z

src/megatron/bridge/models/wan/wan_layer_spec.py

+            query = query.contiguous() # important becuase TE attention expects contiguous tensors
+            key = key.contiguous() # important becuase TE attention expects contiguous tensors


Corrected spelling of 'becuase' to 'because' in both comments.

Suggested change

query = query.contiguous() # important becuase TE attention expects contiguous tensors

key = key.contiguous() # important becuase TE attention expects contiguous tensors

query = query.contiguous() # important because TE attention expects contiguous tensors

key = key.contiguous() # important because TE attention expects contiguous tensors

Copilot · 2025-12-10T23:57:20Z

src/megatron/bridge/models/wan/modules/vae.py

+
+class CausalConv3d(nn.Conv3d):
+    """
+    Causal 3d convolusion.


Corrected spelling of 'convolusion' to 'convolution' in docstring.

Suggested change

Causal 3d convolusion.

Causal 3d convolution.

Copilot · 2025-12-10T23:57:20Z

src/megatron/bridge/models/wan/flow_matching/flow_inference_pipeline.py

+            Input_frames (`list[Tensor]`):
+                Input frames for content generation
+            Input_masks (`list[Tensor]`):
+                Input masks for content generation
+            Input_ref_images (`list[Tensor]`):


Parameter names should follow snake_case convention. These should be input_frames, input_masks, and input_ref_images instead of capitalized versions.

Suggested change

Input_frames (`list[Tensor]`):

Input frames for content generation

Input_masks (`list[Tensor]`):

Input masks for content generation

Input_ref_images (`list[Tensor]`):

input_frames (`list[Tensor]`):

Input frames for content generation

input_masks (`list[Tensor]`):

Input masks for content generation

input_ref_images (`list[Tensor]`):

Copilot · 2025-12-10T23:57:20Z

src/megatron/bridge/models/gemma/gemma_provider.py

+    """Configuration for a 2B parameter Code Gemma model.
+
+    Extends GemmaModelProvider with specific settings for code generation.
+    Thism model has an identical configuration to GemmaModelProvider2B.


Corrected spelling of 'Thism' to 'This' in docstring.

Suggested change

Thism model has an identical configuration to GemmaModelProvider2B.

This model has an identical configuration to GemmaModelProvider2B.

abhinavg4 and others added 30 commits September 30, 2025 14:23

Initial commit

2bb8969

[docs] Add canonical lora docs (NVIDIA-NeMo#821)

4bba0e6

Signed-off-by: Ananth Subramaniam <[email protected]>

ci: Bump pre-flight (NVIDIA-NeMo#854)

7e2eeaa

Signed-off-by: oliver könig <[email protected]>

Gemma2 provider + Bridge (NVIDIA-NeMo#856)

a4912e7

* gemma2 provider and bridge Signed-off-by: Ananth Subramaniam <[email protected]> * gemma2 model provider + bridge Signed-off-by: Ananth Subramaniam <[email protected]> --------- Signed-off-by: Ananth Subramaniam <[email protected]>

support async saving for CI end to end testing (NVIDIA-NeMo#804)

ad94387

Signed-off-by: Ananth Subramaniam <[email protected]>

docs: Revert 0.2.0 push (NVIDIA-NeMo#865)

a5d7c58

Signed-off-by: oliver könig <[email protected]>

add tests for functor design

96e7b4c

Signed-off-by: Ananth Subramaniam <[email protected]>

improve typing for forward step func and add tests for functors

4a750dd

Signed-off-by: Ananth Subramaniam <[email protected]>

update tests

e0e8611

Signed-off-by: Ananth Subramaniam <[email protected]>

make checks more robust

7f6ec50

Signed-off-by: Ananth Subramaniam <[email protected]>

docstrings

d6b02c6

Signed-off-by: Ananth Subramaniam <[email protected]>

docstrings

897da83

Signed-off-by: Ananth Subramaniam <[email protected]>

docstrings

b7ad487

Signed-off-by: Ananth Subramaniam <[email protected]>

fix tests

a6ae7a3

Signed-off-by: Ananth Subramaniam <[email protected]>

inject state once at the beginning of the loops

6883596

Signed-off-by: Ananth Subramaniam <[email protected]>

cleanup

23e9efc

Signed-off-by: Ananth Subramaniam <[email protected]>

add tests

ab4f32d

Signed-off-by: Ananth Subramaniam <[email protected]>

abhinavg4 and others added 24 commits October 6, 2025 09:33

Merge branch 'functor' of https://github.com/ananthsub/Megatron-Bridge …

db1b812

…into init_dit

diffusion_energon_datamodule

7a701f6

runnanle mcore Wan inference

a86856a

clean inference code

544ad75

workable model implementation, inference, finetuning

e41b3d1

add example commands

74da525

add example commands

0189812

runnable thd, without containers edits

a2a2580

update commands

77f2673

add example commands

bf4b652

add example commands

2b4fd60

fix example_commands.sh

a263c00

vace

ea6bb12

hf verification

e8e30d2

add support for tp and cp

59d3e99

add profiling

afdd3c6

fix memory issues

5996456

enable batch size more than 1

f25c81a

add additional output for context branch and additional input for bas…

7eba845

…e branch

vace pretrain scripts

40e0e32

Vace I2V finetuning

661acb1

Finetuning for V2V

dccfce4

add annotator

c985676

Copilot AI review requested due to automatic review settings December 10, 2025 23:56

Copilot AI reviewed Dec 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vace finetuning#3

Vace finetuning#3
Tatiana21 wants to merge 54 commits intohuvunvidia:mainfrom
NeverMore960114:vace_ft

Tatiana21 commented Dec 10, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 10, 2025

Uh oh!

Copilot AI Dec 10, 2025

Uh oh!

Copilot AI Dec 10, 2025

Uh oh!

Copilot AI Dec 10, 2025

Uh oh!

Copilot AI Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

		query = query.contiguous() # important becuase TE attention expects contiguous tensors
		key = key.contiguous() # important becuase TE attention expects contiguous tensors

	Thism model has an identical configuration to GemmaModelProvider2B.
	This model has an identical configuration to GemmaModelProvider2B.

Conversation

Tatiana21 commented Dec 10, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants